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IBSTBICT ^^^^^ report on a project to develop a standara 

cornus of present-day Bandar in Chinese is presented. /J^^PJ® 
cSSIilts of words of running te«t of Chinese prose printed in the 
Itllhllt t ^ China during thl calendar year 1968. The corpus, jltkjog*^ 
S?SginallY planned to hive a total of 500 sauples of 2000 words each, 
has only 294 saaples. Bach saaple starts at the beginning of a 
slnteScl but Sot necessarily at the beginning of a P«agraph or 
?ISgl? dlviSioS. The saaples represent a variety ^f^^^J^f^f •Jjf 
prole, selected for their representative «J»3-i*?/^5^^f,*5t« So. 
ii^ecarv uerit. The collection consists priiarily of sauples froi 
S^kS SSa soae .a jor periodicals available through the library at the 
SSiSnal Taiwan BSive?sity and the National Central Lxbrary. For each 
sauDle collectea, a copy was aade and then transcribed into a 
SSaSierPln-yln rouanilation. Por each sauple, counts i^^J taken of 
the following- naaes, foruulae, figures, foreign strings, foreign 
wo?dl? woJdS'Un totil) , and syllables. |f ter the |a»ples were 
collected and roaanized, they were then codified. J uanuax 
aScoupanies the corpus, which coaprises one sag net ic tape of about 

1,200 feet, available in either 7-track or 9-track node. 
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Contents 

This standard corpus o£ pcesent-day Chinese consists 
of words of running text of Chinese prose printed in the 
Republic of China during the calendar year 1968. Some of 
this material has undoubtedly been written earlier^ but no 
material known to be a second edition or reprint of an 
earlier text has been used« 

The corpus, although originally planned to have a 
total of five hundred samples of 2000 words each, and 
fully comparable to the corpus of American English 
collected by Professor w. nelson Francis in 1964, has only 
294 samples. Sach sample begins at the beginning of a 
sentence, but not necessarily at the beginning of a 
paragraph or larger ditrision* The samples represent a 
vf^riety of styles and varieties of aodern prose, ks with 
the English corpus, verse was not included on the ground 
that it presents special linguistic problems different 
from those of prose. Drama was excluded for similar 
reasons. Fiction was included, but samples with more than 
SQ.i of spoicen dialogue were excluded, as were those which 
consisted of more than 50% Literary Chinese {wen2-*yan2) • 
Saaples were chosen for thoir representative quality 
rather than for their literary merit or other qualities. 

The project was begun with a conference on the 
collection of a corpus held at Brown University on 30 
June 1970. Hare an attempt was made to establish the 
catogorios and sub- categories of the Chinese Corpus as 
contraistod with the English corpus, and a variety of 
special details on the collection of the corpus were 
spt*cified. In addition, there were a number of specific 
suggestions on the coding of input, and on difficulties 
which should be avoided. 

On arrival in Taiwan, and based on suggestions frora 
the conference, I made an effort to adjust any implicit 
cultucal bids of the preliminary list of categories and 
the numbers of samples to be included from each, and to 
adjust moro accurately *-o the publishing situation in 
Taiwan by consulting the publicatiorui directory for the 
province of Vaiwan for the previous year {Chul-ban3 
3UiU-yn4 Dang1-ji4 Yi 1 Lan2, Taipei, Ministry of the 
Interior, 196B), and the acquisitions lists of the 
National Central Library for a period of approximately two 
and one-half years. I prepared acquisitions statistics 
for all the books and periodicals, separated according to 
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the Dewey Decimal Classification System. The statistics 
£01: ac^iuisitions in all of these categoiries »ece averaged, 
and computed to a percentage of a number of saopXes in a 
500 word sample corpus. 

Then, these numbers were used as guidelines for the 
final collection, constituting liaiits on the number of 
saiaplos in each category or sub-category; 

The universe from which the samples were chosen was 
further defined as selections from works held in the 
Library of the National Taiwan University, and when the 
holdings of this library proved inadequate or 
indcce::ssible, further supplemented from the holdings of 
thy National Central Library, and occasionally from other 
sources in the case of ephemera no longer available in the 
larger libraries. 

Generally, we began with the collection of the 
sanplos from books, and some major periodicals which were 
available to us through the Li.t ry of the National Taiwan 
University, attempting to Cv^'ect samples in each 
category. The first step in as. ..ring randomness at this 
iJtago was the> proparation of a slip for each booJc 
published in 1967 held by the library, with call number, 
authoc and titlo. aandoa nunbec tables were then used in 
selecting the book by its call nunber, treating the call 
nunbor as a stjci^s of digits, and solecting smaller and 
snallor subsets of the whole file with each look-up until 
only a single title was left. The saaple was defined 
using sicilar procodurea with the page nufnbers. 

antoctunately there proved to bo no easy way to 
pr-s-count the words no as to assure that there would be 
2000 in eich sairplo, and various expodients wore tried, 
all iapsroclsrf, Thorofore, many o£ the samples fall short 
of the :aOOO-'^oi:d 'joal, with those collected earlier 
farthest from the mark, whore we had usod the number of 
characters in the nanpl« as guide. Later, with 
exp_>riop.cG, wg Inirnod to have tha transcribers count the 
total nur.ber or linos of toxt transcribed instead, with 
con;>i.'»tGatly b^^ttor rosuLts. 

For odch yi.-nple soiectod, a copy was made on locally 
aviilable copying equipnonfc, ninco many of the materials 
•.f£»ra not circulsiblo, and t!io copy was given to. a typist 
for traniscription into a !.'.odifio<i Pin-yin ronaniza tion. 
Hon tnization was chosen as the vohicle for transcription 
onl*/ partly brtcause of its increasing acceptance within 
tl-.n Chinosf? culture. 

Tho nost inportant reason for the choice of this 



comanization rested in the assunption that the text vottXd 
be used by Chinese and Hesternecs fantiXiac with Chinese^ 
and that they vould be sore coQfoctable reading rcnianized 
text than reading material transcribed into one of the 
versions of the telegraphic code, or into some specialized 
machine coding* This choice was siade with the awareness 
that some ambiguity would remain because different words 
would be spelled in the same way. tfe assumed that the 
literate user could disambiguate such forms from the 
context, and made special efforts in the coding to 
minimize homonymy* 

The samples were ^irst transcribed, the transciption 
was proof-read, then the proof-read and edited copy was 
keypunched, and the cards checked for errors. 

After the cards had been read onto tape» the sample 
was again checked by nachine with a simple prograiB 
designed to check for the two commonest errors, absence o£ 
a tone-mark on any syllable, and the insertion of a space 
in the middle of a word, caused when the keypunch operator 
left a space at the end of thr; card at the end of a 
syllable that did not finish a word. Hhen these were 
found, corrections were made by inserting a new, 
corrected, card. 

For each sample, counts wore made for the following 
items: Names, undifferentiated as to whether they are 
personal or place nanas; Formulae; Figures; Foreign 
Strings, which are separate occurrences of one or more 
foreign words; and Foreign vords, the total of the foreign 
words listed in the sonante strings; Ifords, a total of 
all of the separate graph icilly defined words; followed by 
a listing of total syllablos in the sample, with two 
subsets of this last: om being syllables preceded by a 
plus sign; and the othar syllables preceded by an 
asterisk, which diahJ actions have grammatical 
signif icdnce, as noted ia the section on Coding 
Conventions, below. For •?^ch sample, the list of counts 
follows the bibliographic dita. 

In May, 1974 the tot il counts are: Names, 25,928; 
Formulae, 69; Figures, QO-, ^ocoiyn Strings, 2,137; Foreign 
Words, 6,tH9; Words, Syllables, 1,070,932; 

^-Syllables, 86,037; *r>yllabloj, U,853. 

The coding procodurr.t; thit allow retrieval of the 
many claas«*s of items suc!i xa nanas and formulae introduce 
aoste featuras that are i'lconsii'jtont with the requirement 
for the unity of the (]v.i,hic word. For example, the 
notUfication particle »~a".>», should be immediately linked 
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to the previous syllable. However, when it is preceded hjt 
a vord which is a personal or place namer coding 
conventions cooie into conflict, and the requ^^reaent that 
the symhols which designate a na&e be separated froQ 
surroundio? text by a blank space takes precedence in the 
ordered set o£ coding rules* Since soao of these stay be 
potentially confusing to the user, two separate lists have 
been prepared which reflect two different levels of 
specificity with respect to the scale of inconsistency. 

The first, which, according to raore economical 
standards of definition, nay reflect coding errors at the 
sore extreme end of the inconsistency scale is appended 
directly to the bibliographic description of the sample 
itself, and is headed 'Possible Errors*. This list is 
presented in two columns in the text of the manual, with 
the colutsn to the left representing the offending item, 
and the context In which it occurs shown in the column to 
the right, with the context displayed with ten spaces to 
the left and right of the item, while the item itself is 
represented by an underscore. The ten-space limitation is 
usually effective in delimiting the context so that the 
user can judge whether a real error has been made, but 
also usually offends the eye and the sense in that it 
breaks the majority of both preceding and following words 
in such a way as to be aesthetically displeasing, nany of 
the items in these lists represent faithful recognition of 
rep:^titive errors not taken full account of in the initial 
coding, specifically, of the items listed as 'Possible 
Errors' for the sample A03, most wore collected because 
the programs we used specified that a syllable not 
followed by a numeral indicating a full or neutral tone 
wds to be considered as a 'Possible Error*. The list 
there presented was machine selected because the letters 
•AS »B», »c«, and 'D* in the combinations 'Group A», 
'Group B», 'Group c», and 'Group D' were, properly, not 
followed by tone-indicating digits. 

similarly, the long listing of 'possible Errors' for 
sample GJ** reflects assiduous machine concern with the 
fact that • similar lack of tone ta«iging is reflected on 
each repetition of each letter in the freguent citutions 
in this sample of the drug LSD. The human editor can deal 
with this anomaly lightly and dismiss it as of no concern, 
but lack of such formalized machine pirocedures may cause 
the user to overlook such real errors as that in A09 or 
the many in J02 in which the omission of a space preceding 
or following an item may well prevent its inclusion in a 
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concordance until it has been coccected. On the whole it 
seested better to leave this level of 'Possible Error* 
associated directly with the text and its description. 

For the second list# which seemed on inspection to 
represent a lower level of real error, and a higher 
incidence of coding anoisalies, or of coding rules in 
conflict with respect to their relative rank, it seemed 
more efficient to relegate the auch longer list to an 
appendix listed according to sample, and it is there so 
presented* 



Problems 

Initially it was found that there were very few 
individuals who were willing to work on a short-tern 
project of such magnitude* Additionally there was a 
problem of finding individuals whose typing skills were 
adequate for rapid conversion from Chinese characters into 
romanized text and whose dialect was sufficiently standard 
to assure that they could represent Chinese in a standard 
ronanization. Dialect variance was shockingly 
wide-ranging* As an initial partial solution, a single 
typist was hired as a full-tiina secretary-transcriber, 
she had had considerable experience in English typing, 
spoke standard Mandarin, and had worked with roaanized 
Mar.aacin previously. over an initial few months it 
developed that there were both personal and institutional 
problocs in requiring individuals to do careful 
transcription over long periods during any working day. 
As the problems became clearer, it became apparent that 
the intensity involved in detailed, accurate transcription 
of technical and literary aaterial of this type could not 
be done full-tiae by any single individual for very many 
hours per day. He experimented briefly with two or three 
part-time typists, who were personally recommended for 
their skills, but found both quality and output to be low. 
At this point we began the search for conpetent, 
experi«nced, careful, quality-oriented typist-tranacribers 
who could handle the range of material that we had cUoisen. 
It turned out that no one was interested in working 
full-time and that we should put our ofiorts into 
recruiting a larger number of prtct-tiae 
typist-transcribers. By this time the first 6 months of 
the initial contract period had passed with only iiaiited 
production* It was clear that any attempt to use a large 
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number of part*tiaie typist- tcanscdbers would cequice nuch 
more clearly described definition of tasks, expectations, 
and standards for tfork than had so far been described. 
Therefore, X turned oy efforts to developing a prograi&iaed 
outline to teach transcription from Chinese characters 
into Hooanized Chinese in the Pin'-yin systea. The progras 
begins vith simple consonants, and increases in 
coropiexity, progressing to vowels, syllables, the tonal 
notation adapted for the input of of unstressed syllables; 
then to the techniques for joining syllables into words, 
and for indicating bound forias and sioilar graniaatical 
adjuncts and finally to the special codes for place and 
personal names. 

Directions to the program instruct the user to use 
the self-checking features in the program; and by the time 
the program had been conpleted the user could not only use 
the system with soae facility, but had learned to use the 
program itself as a reference device in the later stages 
of learning, and when they began work on text samples* 
This program- proved to be immensely successful, and 
simplified all later activity* 

At an appropriate time in the development of this 
program for teaching edited Pin-yin romanization, we 
inserted an advertisenent in three major papers to run for 
three days* The advertisement was for English typists 
with experience in romanization who wished to work 
part-time, and Instructed them to seni a vita sheet to a 
Post Office box. It further requested that anyone 
interested send in a ronianization of the advertisement 
itselt. This last feature wds helpful and revealing in 
that it pre-sel:*cted those who had both typing and 
romanizatiou sKills, and wore willini. to put the effort 
into the transcription* He had more than two hundred 
replies, more than a hundred of which indicated that they 
had not understood the specifications. The rest were 
divided into throo groups; a. a group which had submitted 
typed responses which minimdlly indicated access to a 
typewriter. All had good control of typing skills, and 
showed that they had mastered at leajt one romanization 
cor Chinese. There wer« about thirty-tive names in this 
group, b* A second lot which had poorer typing and 
romanization, and c. All others who had tried to fulfill 
the roquiremontG, but simply could not perform 
effectively* 

Those in the first group were rank-ordored on a scale 
which put primary emphasis on accuracy of typing and 
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coccectness o£ romanizatlcQs. Neatness vas judged only 
Insofai: as legibility was at issue. Clean erasures were 
accepted^ but if nuisecous, served to dovncank the 
individual. 

The first twenty o£ these on the list were approached 
with a copy of the prograo developed to teach the Pin-yin 
rooanization, and a sanple to be coded* A few of these^ 
on seeing the complexity of the tasK» indicated that they 
had no interest in such work* but aliaost all attempted to 
do one saaple. They were told that the first sanples 
would be paid at a premium rate^ the second at a slightly 
lower rate, and all others at a flat rate# and that after 
the second sample was completed the rate would be reduced 
for evidence of error, so that a premiuB vas put on 
accuracy. At the end of the first sample, it vas clear 
that some were admirably suited to this work, and they 
vers continued, and no contact was made with the others on 
the list. He ended up with about fifteen able 
transcribers whose accuracy rate vas surprisingly high. It 
vas these on vhom ve relied for all later transcriptions. 



Coding Conventions 

The texts are romanized using the Pin-yin system vith 
a number representing the tone of the syllable 
incorporated at the end of each syllable, and *v* 
representing tt. In addition to the digits one through 
four to indicate the four tones of Ifaadarin, and the use 
of five to indicate the tone of characters that are 
nocmally neutral, (ging 1-yin1) , four more *tones* are also 
distinguished. Characters which are normally first tone, 
but in context are read as neutral are coded with *5*; 
neutral tones normally having the second tone are coded 
with and similarly with the characters normally 

having the third and fourth tone. For purposes of 
retrieval and disambiguation, each tone from is 
represented eithe.r by this number, or by this number plus 
fiva. In each case, the tone follows the romanization 
immediately with no intervening space. 

some grammatical and formatting information vas also 
coded, using special Identification marks. In considering 
frequency and word distribution, it is assumed that the 
graphic word rather than the single character in Isolation 
is the unit to be considered. Therefore, words of more 
than one character are coded by connecting syllables with 
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a with no spaces between the syllables. Such coding 

will pecmit the collection o£ data on honophones, bat foe 
cecovecy pucposes will also serve to limit the ausber of 
hoffiophostous 'words*, the standard usually applied is that 
of 'ndtive-'speaker intuition* » and nay result in soae 
disccepancios in coding^ although an effort has been made 
in the editing process to reduce these discrepencies. 

Other connecting symbols are also used. Rhen 
ordinarily «free» syllables are foroed into eleiasnts of 
words, such as the bu4 and the de7 in resultative 
compoundSf the syllables in such a compound are connected 
by the plus sign. »hen the syllable is bound to the 
sentence as a whole rather than the word, the connecting 
symbol is the asterisk. 

Personal and place nanes, as well as aost of the 
format codes, have a single identifying symbol preceding 
the itetQ^ and a double symbol following the item. In each 
case the single or repeated symbol is both preceded and 
followed by a blank space to facilitate retrieval. 

The following special codes are used for these 
purposes: Personal and place names: d...a(d. Chapter 
headings: Interlinoar annotations: $•••$$, 

Footnotes and other annotations: I...!!. 

Since modern Chinese prose ofton cites foreign words 
or phrases which cannot be coded according to the 
conventions for Chinese material, material in this form is 
coded within parentheses as follows; The per cent symbol 

preceded and followed by a blank, followed by a 
two-letter code for the language, followed by a period, 
which is then followed by a number indicating the numbc^* 
of words in the citation. The number is followed by ^ 
blank. An example: ( % IT. 6 ) represents the occurence 
of six words in Italian. 

In addition, certain non-language data often occurs 
in Chinese. For this corpus, coding for figures, graphs, 
and charts and for equations and formulae is adequate. 
Each of these is coded by prefixing a double asterisk 
within parentheses both before and after the equation or 
figure, with a space before and after the double asterisk. 
Tho code for fi«juros, graphs, or charts uses the capital 
letters 'FG*, and that for equations or formulae is the 
capital letters •PiP. Thus the presance of a figure or 
chart in the text is coded as ( Fti ) and that for a 
formula or eyuation is ( ** Pfl ). 

In order to save time in coding the original samples, 
and provide a psychological •out* for a coder who might 
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not be trilling to admit to inability to read sone iten^ 
they voce instructed to type out a sequence o£ seven x*s 
vhon they were unsure o£ a costanization or reading for a 
character (XXXXXXX) • Although attempts have been otade to 
locate and give proper readings for such characters, this 
vill remain the symbol £or a character vhose reading has 
not yet been determined for the specific context. 

The first 70 columns of each card contain the text, 
column 71 is left blanJc, and columns 72 through 80 contain 
a location maricer. The coding of the text in columns 1 
through 70 is that described in the previous pages of this 
chapter* The location marker contains a line number in 
columns 72 through 75, the characters •C1» in columns 76 
and 77 (which identify the corpus and the first Chinese 
corpus, in analogy vith the English »B1» and the Russian 
*R1*, and a sample marker (76-80) consists of a letter 
corresponding to the classification of the sample as to 
genre, and a tvo*digit number (between 01 and 99) 
identifying the sample uniquely within that genre. 

For purposes of the word distribution studies the 
graphic word is considered the basic unit and its 
integrity is never jeopardized. Therefore, if a word ends 
in column 70, column 1 of the next card is left blank and 
the next word is punched starting in column 2 without an 
intervening hyphen. If a word ends in column 69, column 
70 is lot't blank and punching continues, starting in 
column 1 of the next card. 

To facilitate the correction of proofread cards two 
conventions were adopted which are of particular interest 
to a prograaroer using the corpus: 1) a string of blanks 
greater than one is equivalent to a single blank and 2) 
tho character is to be interpreted as "delete me and all 
tho following blanks, if any occur". (The English uses ♦ 
for this). 



Records 

A number of record-keeping errors have resulted in 
the loss of cnrtain data for particular samples. In the 
mdnual, each of these is identified at tho place at which 
the sample number occurs. Since thes9 notifications are 
insarted in the text of the manual using the imbed 
instruction available with NSCRIPT, alterations and 
updating will be possible as soon as the information 
becomes available. Users should request the latest 
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The Banttal vas prepared by Sisses BaiXr Calkins and 
Vicky ffilliaas uader the direction of Br. Gerald Babin of 
Brown university, nsin? the text-editing progran RSCBXPT* 
A text- editing progran vas chosen so that as corrections 
or additions to the aannal are regnired they can be added 
to later editions with relative ease* 

^ Since the nanual is not intended to be part of the 
corpus itself, soae of the conventions used in the Baanal 
differ fros those in the corpus. Place naaes are 
ronanixed in Pin-yin, bot vithout toaes» except where the 
place is known by a sore faailiar roaanization, or is a 
representation of a foreign word. 

Thtts, Xiang1-gang3 is *Bongkong*, and Lao2'»si1*fa2 is 
*Boosevelt*. Designations for streets and roads and other 
locitions are translated rather than roaanized, and where 
this Bight be aabigaons, as with the terss for * village*, 
the designation is followed by the ronanization of the 
Chinese character in parentheses, as *Panchiao Village* 
(zhenO) , and *Koanghiia New Village ftsitnl)* 



Basic Technical Inforaation 

The Standard Chinese corpus is available to 
interested scholars* It coaprises one Bagnetic Tape of 
about 1200 feet in length* The Corpus is available in 
either 7- track or 9- track node* 

The organization of the data on tape corresponds to 
the punched card foroat described above* The data is 
recorded on tape in card-iaage fora, all cards 

(including correction cards and other incowpletely filled 
cards) are reproduced in their entirety* However, for 
reasons of operating efficiency in processing the Corpus 
dat«, the card-laage records are grouped on tape by a 
blocking factor of ao, so that the tape is coaposed of a 
series of fixed*length tape records each containing 3,200 
characters (iA.S£.» 40 cards)* Since sasple size and tape 
r*»cord length are Independent of each other, the end of a 
saople does not necessarily coincide with the end of a 
tape record* 



Availability 



Both the aanaal and the Cocpas ace available fcoa the 
Depart sent of Linguistics, Bxrovn Onivecsity, providence, 
Rhode Island 02912. The Cocpos can be ordered in either 
7*track or 9- track £oraat and at several recording 
densities. Purchasers say send their ovn tapes to be 
copied onto for a service and handling charge of $50.00, 
or they nay request that copied tapes be famished at a 
charge of $75.00. Both these prices include one copy of 
the • nanaal. Additional copies of the aan'ttal can be 
purchased for $10.00. 
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